Before we begin, let us load some packages that we will potentially be using. For this assignment, we will be wrangling part of the gapminder dataset.
library(gapminder)
library(tidyverse)
library(dplyr)
library(forcats)
library(ggplot2)
library(tibble)
library(DT)
library(plotly)
library(gridExtra)
library(scales)
The here::here package in RStudio projects is a great package that has its primary advantages when sharing code with collaborators. When using other methods to specify the directory in your code, the code may malfunction due to the collaborators using different operating systems. This package also replaces the need to specify the working directory in the code, which again would cause the code to malfunction if collaborators are trying to run the code without having the same file path expressed on their computer. Other advantages to the package include the ease in which sub-directories can be managed and even used if the files are opened outside of a project on RStudio.
Let’s explore the gapminder dataset. Before beginning, let’s verify that a variable within the dataset, such as continent, is a factor.
gapminder$continent %>%
class()
## [1] "factor"
The variable continent is indeed a factor.
Let’s look at how many levels there are within this factor, and let’s identify the different levels.
gapminder$continent %>%
nlevels()
## [1] 5
gapminder$continent %>%
levels()
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
There are 5 levels, which are listed above; the number of entries for each continent can be found in the table below.
gapminder %>%
count(continent) %>%
datatable()
There are relatively few entries for Oceania, so let’s remove Oceania from the dataset. First, let’s filter through the dataset to remove observations from Oceania.
Ex2 <- c("Africa", "Americas", "Asia", "Europe")
Ex2_filter <- gapminder %>%
filter(continent %in% Ex2)
Ex2_filter %>%
count(continent) %>%
datatable()
Ex2_filter$continent %>%
nlevels()
## [1] 5
Ex2_filter$continent %>%
levels()
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
Oceania was successfully filtered from the dataset! However, we can see that the number of levels didn’t change, and Oceania is still listed as a level; Oceania is now considered an unused level. Let’s drop this unused level.
Ex2_dropped <- Ex2_filter %>%
droplevels()
Ex2_dropped$continent %>%
nlevels()
## [1] 4
Ex2_dropped$continent %>%
levels()
## [1] "Africa" "Americas" "Asia" "Europe"
We successfully dropped the unused levels, and Oceania is no longer listed as a level.
Now that we have removed Oceania, let’s look at the standard deviation of life expectancy in the remaining data and display them in a visual manner using boxplots.
Ex2_P2 <- Ex2_dropped %>%
group_by(continent) %>%
summarize(sigma = sd(lifeExp))
Ex2_P2_r = Ex2_P2
Ex2_P2_r$sigma = round(Ex2_P2_r$sigma, 1)
datatable(Ex2_P2_r)
Ex2_dropped %>%
select(continent, lifeExp) %>%
arrange(continent) %>%
ggplot(aes(continent, lifeExp)) +
geom_boxplot() +
ylab("Life expectancy") +
xlab("Continent") +
theme_bw()
Let’s reorder the boxplots based on the standard deviation.
Ex2_dropped %>%
select(continent, lifeExp) %>%
arrange(continent) %>%
ggplot(aes(fct_relevel(continent, "Asia", "Americas", "Africa", "Europe"), lifeExp)) +
geom_boxplot() +
ylab("Life expectancy") +
xlab("Continent") +
theme_bw()
Now the boxes become tighter from left to right!
Let’s create a dataset with only data from Poland. Only data about the year, life expectancy, population, and GDP per capita will be kept since the country and continent will be redundant.
Poland <- gapminder %>%
filter(country == "Poland") %>%
select(year, lifeExp, pop, gdpPercap)
write_csv(Poland, here::here("hw05", "Poland.csv"))
Now that we have created a CSV file in our working directories, let’s read in that CSV file so we can see the data!
Poland_file = here::here("hw05", "Poland.csv")
Poland_new <- read_csv(Poland_file)
datatable(Poland_new)
Many figures were created while completing previous assignments. However, in light of recent developments, let’s go back to Assignment# and recreate the figure from Task#, showing both the original figure with the new figure side-by-side.
Ex3.3_o <- gapminder %>%
group_by(continent) %>%
select(continent, gdpPercap) %>%
arrange(continent) %>%
ggplot(aes(continent, gdpPercap)) +
geom_boxplot() +
ylab("GDP Per Capita") +
scale_y_continuous(labels = dollar_format()) +
xlab("Continent") +
theme_bw()
Ex3.3_n <- gapminder %>%
group_by(continent) %>%
select(continent, gdpPercap) %>%
arrange(continent) %>%
ggplot(aes(continent, gdpPercap)) +
geom_boxplot() +
scale_y_continuous(labels = dollar_format()) +
ggtitle("GDP per capita by continent") +
xlab("") +
ylab("") +
theme_bw() +
theme(panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.line = element_line(colour = "grey"))
Ex3.3_n2 <- ggplotly(Ex3.3_n)
Ex3.3_o
Ex3.3_n2
The most notable difference with the second figure is that the second figure is interactive; it also includes a title , which incorporates both axes, which would be redundant to label. The distracting gridlines (both major and minor) have been removed; the top and right axes borders have been removed and have been lightened to grey for minimal distraction.
Let’s save the new plot that we just made in Exercise 4 as a separate file.